Introductory Data Science for Innovation (995N1) – Week 8, 15 November 2021
“[…] text mining seeks to extract useful information from data sources through the identification and exploration of interesting patterns […]“
“[…] data sources are document collections, and interesting patterns are found not among formalized database records but in the unstructured textual data in the documents in these collections” (Feldman and Sanger 2006)
“Text mining represents the ability to take large amounts of unstructured language and quickly extract useful and novel insights that can affect stakeholder decision‐making“ (Kwartler 2017)
Source: Examining linguistic and cultural phenomena (1800-2000) - sample of 5 millions books (Michel et al. 2011)
Source: A sample of 150 year of articles published in British periodicals (Lansdall-Welfare et al. 2017)
Source: Research priorities and societal demand(Ciarli and Ràfols 2019)
Source: Convergence of industries using 2 million newspaper articles from 1989 to 2012 (IC = Industry convergence index, which is based on co-occurrence of industry in a sentence) (Kim et al. 2015)
Source: Emergence of mass production in the text of Scientific American (1845-1995) (Bone and Rotolo 2020)
tidytext package in RSource: https://www.tidytextmining.com
Source: The Guardian, 12 November 2021
“On Friday, youth campaigners, indigenous leaders and Extinction Rebellion members raised a cacophony of chants and drum beats outside Cop26, as civil society groups inside the conference complex staged a walkout to join them. Within the UN-controlled blue zone, delegates darted through the endless meeting halls, or hunched around laptops, as the clock counted down tense minutes to the end of the 12-day conference that is widely understood to be crucial to the future of humanity. The deals already reached The ragged final hours of Cop26 are a distinct contrast to the carefully choreographed first days, when world leaders arrived with bustling entourages to deliver a flourish of eye-catching pledges and, in the case of Boris Johnson, eye-watering metaphors, as the host nation’s prime minster proffered a string of clumsy analogies, likening the climate crisis to a football game and then a James Bond movie in his welcome address….”
Source: The Guardian, 12 November 2021
Wickham and Grolemund (2017) describe tidy data as data were
Part-of-Speech (POS) tagging
Full parsing
The objective is to perform a full syntactical analysis of sentence identifying two elements
Constituency grammars
Dependency grammars
Shallow parsing
tidytexttidytext, the function unnest_tokens enables us to tokenise the text??unnest_tokensreadr, tidyverse, tidytext, and ggplot packageslibrary(readr)
library(ggplot2)
library(tidyverse)
library(tidytext)
my_text <- read_csv("news_articles_example.csv") %>%
select(id, Title)
print(my_text, n = 6)
## # A tibble: 692 × 2 ## id Title ## <dbl> <chr> ## 1 1 The unreal spectacle of COP26;COP26 has had a prepackaged feel to it, a… ## 2 2 Climate activists send a message for COP26 ## 3 3 COP26: Nicola Sturgeon urges Boris Johnson to return and use position t… ## 4 4 COP26: Nicola Sturgeon urges Boris Johnson to return and use position t… ## 5 5 COP26: Police Scotland arrested eight people on penultimate day of Glas… ## 6 6 COP26: Top 10 bizzare moments of the Glasgow climate talks ranked ## # … with 686 more rows
We tokenise each sentence into unigrams
my_text_uni <- my_text %>% unnest_tokens(output = word, input = Title) print(my_text_uni, n = 6)
## # A tibble: 7,925 × 2 ## id word ## <dbl> <chr> ## 1 1 the ## 2 1 unreal ## 3 1 spectacle ## 4 1 of ## 5 1 cop26 ## 6 1 cop26 ## # … with 7,919 more rows
We could also tokenise into n-grams, for example bigrams
my_text_bi <- my_text %>%
unnest_tokens(output = bigram, input = Title,
token = "ngrams", n = 2)
print(my_text_bi, n = 6)
## # A tibble: 7,238 × 2 ## id bigram ## <dbl> <chr> ## 1 1 the unreal ## 2 1 unreal spectacle ## 3 1 spectacle of ## 4 1 of cop26 ## 5 1 cop26 cop26 ## 6 1 cop26 has ## # … with 7,232 more rows
… or trigrams
my_text_tri <- my_text %>%
unnest_tokens(output = trigram, input = Title,
token = "ngrams", n = 3)
print(my_text_tri, n = 6)
## # A tibble: 6,557 × 2 ## id trigram ## <dbl> <chr> ## 1 1 the unreal spectacle ## 2 1 unreal spectacle of ## 3 1 spectacle of cop26 ## 4 1 of cop26 cop26 ## 5 1 cop26 cop26 has ## 6 1 cop26 has had ## # … with 6,551 more rows
… or sequences of characters (five in the case below)
my_text_char <- my_text %>%
unnest_tokens(output = character_shingles, input = Title,
token = "character_shingles", n = 5)
print(my_text_char, n = 6)
## # A tibble: 36,769 × 2 ## id character_shingles ## <dbl> <chr> ## 1 1 theun ## 2 1 heunr ## 3 1 eunre ## 4 1 unrea ## 5 1 nreal ## 6 1 reals ## # … with 36,763 more rows
my_text_uni_pos <- my_text_uni %>% left_join(parts_of_speech, by = "word") print(my_text_uni_pos, n = 6)
## # A tibble: 15,047 × 3 ## id word pos ## <dbl> <chr> <chr> ## 1 1 the Definite Article ## 2 1 the Adverb ## 3 1 unreal Adjective ## 4 1 spectacle Noun ## 5 1 of Noun ## 6 1 of Preposition ## # … with 15,041 more rows
my_text_uni_pos <- my_text_uni_pos %>% group_by(pos) %>% count(pos, sort = T) print(my_text_uni_pos, n = 6)
## # A tibble: 13 × 2 ## # Groups: pos [13] ## pos n ## <chr> <int> ## 1 Noun 4953 ## 2 <NA> 1521 ## 3 Adverb 1456 ## 4 Verb (usu participle) 1449 ## 5 Preposition 1395 ## 6 Adjective 1175 ## # … with 7 more rows
Let’s focus on unigrams and count the number of words
my_text_uni_count <- my_text_uni %>% count(word, sort = T) print(my_text_uni_count, n = 6)
## # A tibble: 2,436 × 2 ## word n ## <chr> <int> ## 1 to 300 ## 2 the 188 ## 3 of 161 ## 4 cop26 160 ## 5 on 148 ## 6 climate 135 ## # … with 2,430 more rows
Basic descriptive statistics
summarise(my_text_uni_count,
num_words = n(),
mean = mean(n),
sd = sd(n),
min = min(n),
max = max(n))
## # A tibble: 1 × 5 ## num_words mean sd min max ## <int> <dbl> <dbl> <int> <int> ## 1 2436 3.25 11.1 1 300
We can plot the most frequent words representing the 1% of the total sample
min_occur <- quantile(my_text_uni_count$n, 0.99) g <- my_text_uni_count %>% filter(n >= min_occur) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(x = word, y = n, fill = word)) + geom_col() + theme(legend.position = "none") + coord_flip()
data(stop_words)
my_text_uni <- my_text_uni %>%
anti_join(stop_words)
my_text_uni_count <- my_text_uni %>%
count(word, sort = T)
g <- my_text_uni_count %>%
filter(n >= min_occur) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n, fill = word)) +
geom_col() +
theme(legend.position = "none") +
coord_flip()
my_text_uni <- my_text_uni %>%
mutate(word_numeric = as.numeric(word))
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
NAs in the table are word, so we can filter those onlymy_text_uni <- my_text_uni %>% filter(is.na(word_numeric)) my_text_uni_count <- my_text_uni %>% count(word, sort = T) min_occur <- quantile(my_text_uni_count$n, 0.98) g <- my_text_uni_count %>% filter(n >= min_occur) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(x = word, y = n, fill = word)) + geom_col() + theme(legend.position = "none") + coord_flip()
my_stop_words <- tibble(word = c("cop26", "final", "news"),
lexicon = "mywords")
my_text_uni<- my_text_uni %>%
anti_join(my_stop_words)
g <- my_text_uni %>%
count(word, sort = T) %>%
filter(n >= min_occur) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n, fill = word)) +
geom_col() +
theme(legend.position = "none") +
coord_flip()
ngramr package to gather datalibrary(ngramr)
library(ggplot2)
googlebook_data <- ngram(c("science", "technology", "innovation"),
smoothing = 0, year_start = 1900)
g <- ggplot(googlebook_data,
aes(x = Year, y = Frequency, colour= Phrase)) +
geom_line()
textstemlibrary(textstem) my_text_uni <- my_text_uni %>% mutate(word_lemma = textstem::lemmatize_words(word))
my_text_uni_count <- my_text_uni %>% count(word_lemma, sort = T) min_occur <- quantile(my_text_uni_count$n, 0.95) g <- my_text_uni_count %>% filter(n >= min_occur) %>% mutate(word_lemma = reorder(word_lemma, n)) %>% ggplot(aes(x = word_lemma, y = n, fill = word_lemma)) + geom_col() + theme(legend.position = "none") + coord_flip()
SnowballClibrary(SnowballC) my_text_uni <- my_text_uni %>% mutate(word_stem = wordStem(word))
my_text_uni_count <- my_text_uni %>% count(word_stem, sort = T) min_occur <- quantile(my_text_uni_count$n, 0.95) g <- my_text_uni_count %>% filter(n >= min_occur) %>% mutate(word_stem = reorder(word_stem, n)) %>% ggplot(aes(x = word_stem, y = n, fill = word_stem)) + geom_col() + theme(legend.position = "none") + coord_flip()
my_text_uni<- read_csv("news_articles_example.csv") %>%
select(id, Hlead) %>%
unnest_tokens(output = word, input = Hlead) %>%
anti_join(stop_words) %>%
mutate(word_numeric = as.numeric(word)) %>%
filter(is.na(word_numeric)) %>%
select(-word_numeric)
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
We can calculate the tf (term frequency by document)
article_words <- my_text_uni %>% count(id, word, sort = T) %>% ungroup() head(article_words)
## # A tibble: 6 × 3 ## id word n ## <dbl> <chr> <int> ## 1 659 pwa 84 ## 2 659 editor 70 ## 3 659 bar 66 ## 4 170 water 57 ## 5 124 parties 52 ## 6 124 paris 50
total_words <- article_words %>%
group_by(id) %>%
summarize(total = sum(n))
print(total_words, n = 6)
## # A tibble: 692 × 2 ## id total ## <dbl> <int> ## 1 1 381 ## 2 2 378 ## 3 3 558 ## 4 4 558 ## 5 5 346 ## 6 6 722 ## # … with 686 more rows
We remove document with only a single word
article_words <- left_join(article_words, total_words) %>% filter(total > 1) %>% mutate(word_freq = n/total) print(article_words, n = 6)
## # A tibble: 131,673 × 5 ## id word n total word_freq ## <dbl> <chr> <int> <int> <dbl> ## 1 659 pwa 84 1168 0.0719 ## 2 659 editor 70 1168 0.0599 ## 3 659 bar 66 1168 0.0565 ## 4 170 water 57 1053 0.0541 ## 5 124 parties 52 2040 0.0255 ## 6 124 paris 50 2040 0.0245 ## # … with 131,667 more rows
Let’s look at the term frequency distribution of three random articles
g <- article_words %>% filter(id == 107 | id == 175 | id == 288 | id == 631) %>% ggplot(aes(word_freq, fill = as.character(id))) + geom_histogram() + xlim(NA, 0.05) + theme(legend.position = "none") + facet_wrap(~id, ncol = 4)
tidytext the function bind_tf_idf calculate the tf-idf scores starting from a tidy text dataset: one row per token per document including the information about term frequency??bind_tf_idfLet’s start from reading the data again
article_words_tfidf <- read_csv("news_articles_example.csv") %>%
select(id, Hlead) %>%
unnest_tokens(output = word, input = Hlead) %>%
anti_join(stop_words) %>%
mutate(word_numeric = as.numeric(word)) %>%
filter(is.na(word_numeric)) %>%
select(-word_numeric) %>%
count(id, word, sort = T) %>%
ungroup() %>%
bind_tf_idf(word, id, n)
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
print(article_words_tfidf, n = 6)
## # A tibble: 131,673 × 6 ## id word n tf idf tf_idf ## <dbl> <chr> <int> <dbl> <dbl> <dbl> ## 1 659 pwa 84 0.0719 6.54 0.470 ## 2 659 editor 70 0.0599 2.93 0.176 ## 3 659 bar 66 0.0565 4.14 0.234 ## 4 170 water 57 0.0541 2.29 0.124 ## 5 124 parties 52 0.0255 2.26 0.0577 ## 6 124 paris 50 0.0245 1.56 0.0381 ## # … with 131,667 more rows
Let’s look at top 15 words by tf-idf in the case of four random articles
g <- article_words_tfidf %>%
filter(id == 107 | id == 175 | id == 288 | id == 631) %>%
group_by(id) %>%
top_n(15, tf_idf) %>%
ungroup %>%
mutate(id = as.factor(id),
word = reorder_within(word, tf_idf, id)) %>%
ggplot(aes(word, tf_idf, fill = id)) +
geom_col() +
facet_wrap(~id, nrow = 2, scales = "free_y") +
coord_flip() +
theme(legend.position = "none") +
scale_x_reordered()
separate(), filter stop words, and then unite() bigrams againarticle_words_tfidf <- read_csv("news_articles_example.csv") %>%
select(id, Hlead) %>%
unnest_tokens(output = bigram, input = Hlead,
token = "ngrams", n = 2) %>%
separate(bigram, c("word_1", "word_2"), sep = " ") %>%
filter(!word_1 %in% stop_words$word) %>%
filter(!word_2 %in% stop_words$word) %>%
unite(bigram, word_1, word_2, sep = " ") %>%
count(id, bigram, sort = T) %>%
bind_tf_idf(bigram, id, n)
print(article_words_tfidf, n = 6)
## # A tibble: 73,776 × 6 ## id bigram n tf idf tf_idf ## <dbl> <chr> <int> <dbl> <dbl> <dbl> ## 1 659 editor bar 64 0.0706 6.54 0.461 ## 2 124 paris agreement 44 0.0486 1.89 0.0917 ## 3 659 pwa editor 34 0.0375 6.54 0.245 ## 4 659 data pwa 32 0.0353 6.54 0.231 ## 5 659 bar panel 30 0.0331 6.54 0.216 ## 6 248 words photos 20 0.0430 6.54 0.281 ## # … with 73,770 more rows
Let’s look at top 15 bigrams by tf-idf in the case of four random articles
g <- article_words_tfidf %>%
filter(id == 107 | id == 175 | id == 288 | id == 631) %>%
group_by(id) %>%
top_n(5, tf_idf) %>%
ungroup %>%
mutate(id = as.factor(id),
bigram = reorder_within(bigram, tf_idf, id)) %>%
ggplot(aes(bigram, tf_idf, fill = id)) +
geom_col() +
facet_wrap(~id, nrow = 2, scales = "free_y") +
coord_flip() +
theme(legend.position = "none") +
scale_x_reordered()
Group 1
Adebisi, Jongho, Maria, Keiho
Group 2
Charunan, Poojani, Abdul, Satoshi
Group 3
Oscar, Tsukumo, Jiyoung, Nicholas
Group 4
Alessandro, Shaunna, Jonathan, Rachel
Bone, Frederique, and Daniele Rotolo. 2020. “ Text mining historical sources to trace technological change. The case of mass production.” Working Paper.
Ciarli, Tommaso, and Ismael Ràfols. 2019. “ The relation between research priorities and societal demands: The case of rice.” Research Policy 48 (4): 949–67. https://doi.org/https://doi.org/10.1016/j.respol.2018.10.027.
Feldman, Ronen, and James Sanger. 2006. The Text Mining Handbook. Cambridge, United Kingdom: Cambridge University Press. https://doi.org/10.1017/cbo9780511546914.
Kim, Namil, Hyeokseong Lee, Wonjoon Kim, Hyunjong Lee, and Jong Hwan Suh. 2015. “ Dynamic patterns of industry convergence: Evidence from a large amount of unstructured data.” Research Policy 44 (9): 1734–48. https://doi.org/https://doi.org/10.1016/j.respol.2015.02.001.
Kwartler, Ted. 2017. Text Mining in Practice with R. Chichester, United Kingdom: John Wiley & Sons Ltd. https://doi.org/10.1002/9781119282105.
Lansdall-Welfare, Thomas, Saatviga Sudhahar, James Thompson, Justin Lewis, FindMyPast Newspaper FindMyPast Newspaper Team, and Nello Cristianini. 2017. “ Content analysis of 150 years of British periodicals.” Proceedings of the National Academy of Sciences of the United States of America 114 (4): E457–65. https://doi.org/10.1073/pnas.1606380114.
Michel, Jean Baptiste, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, Joseph P. Pickett, Dale Hoiberg, et al. 2011. “ Quantitative analysis of culture using millions of digitized books.” Science 331: 176–82. https://doi.org/10.1126/science.1199644.
Silge, Julia, and David Robinson. 2017. Text mining with R: A tidy approach. Sebastopol, CA: O’Reilly Media.
Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1st ed. O’Reilly Media, Inc.